update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45

cjprybol · 2017-03-31T02:59:06Z

The current vcat operates on Arrays of DataTables DataTable[], while Base.vcat utilizes slurping to capture any number of inputs as a tuple or argument to concatenate. This PR changes vcat to follow Base.vcat's lead and utilizes a function structure of vcat(dts::AbstractDataTable...). Uses @nalimilan's @generated function to implement a new type of AbstractArray promotion rule that ideally can be tested here and transfered into Base in the near future JuliaLang/julia#20815. This also removes the prior assumptions that were made about how to join datatables that were out of order or had columns that were only present in some of the arguments and not others.

nalimilan

Thanks, looks mostly good. Please copy the explanations to the commit message (BTW, I think by "slurping" you mean "splatting"?).

It's not completely correct to say "The current vcat operates on Arrays of DataTable", since the vararg form was also supported: what you are doing is that you remove the form taking a vector of tables.

nalimilan · 2017-04-02T09:58:40Z

src/abstractdatatable/abstractdatatable.jl

+    allheaders = map(names, dts)
+    # don't vcat empty DataTables
+    notempty = find(x -> length(x) > 0, allheaders)
+    uniqueheaders = unique(allheaders[notempty])


I'm not sure about this: is there a strong reason not to check headers of empty tables? If they don't match, it could indicate a bug in the user code, and I can't find a situation where this would be useful.

Also, it's not correct to return DataTable() in that case since that doesn't preserve the headers if only empty tables are passed.

I think we agree here, but just to clarify

julia> vcat(DataTable(), DataTable(), DataTable()) #this is correct 0×0 DataTables.DataTable julia> vcat(DataTable(), DataTable(), DataTable(x=[])) # this should be an error 0×1 DataTables.DataTable julia> vcat(DataTable(), DataTable(), DataTable(x=[1])) # should also be an error 1×1 DataTables.DataTable │ Row │ x │ ├─────┼───┤ │ 1 │ 1 │

Yes, that's what I meant.

nalimilan · 2017-04-02T10:01:12Z

src/abstractdatatable/abstractdatatable.jl

+            estrings = Vector{String}(length(uniqueheaders))
+            for (i, u) in enumerate(uniqueheaders)
+                matchingloci = find(h -> u == h, allheaders)
+                headerdiff = filter(x -> !in(x, u), coldiff)


This is just setdiff(coldiff, u), right?

nalimilan · 2017-04-02T10:02:04Z

src/abstractdatatable/abstractdatatable.jl

+            for (i, u) in enumerate(uniqueheaders)
+                matchingloci = find(h -> u == h, allheaders)
+                headerdiff = filter(x -> !in(x, u), coldiff)
+                headerdiff = join(string.(headerdiff), ", ", " and ")


Avoid reusing variable names for values of different types, this creates a type instability. Same below.

Also string isn't needed here an below. So you might be able to put the commands directly inside the string literal.

nalimilan · 2017-04-02T10:03:11Z

src/abstractdatatable/abstractdatatable.jl

+            filter!(u -> Set(u) != Set(unionunique), uniqueheaders)
+            estrings = Vector{String}(length(uniqueheaders))
+            for (i, u) in enumerate(uniqueheaders)
+                matchingloci = find(h -> u == h, allheaders)


Is "loci" really useful here?

nalimilan · 2017-04-02T10:08:18Z

src/abstractdatatable/abstractdatatable.jl

+        for i in 1:length(cols)
+            data = [dt[i] for dt in dts_to_vcat]
+            res = promote_col_type(data...)(mapreduce(length, +, data))
+            cols[i] = copy!(res, vcat(data...))


Calling vcat will create a temporary array. Better repeatedly call copy! with a changing offset (see do argument).

nalimilan · 2017-04-02T10:14:51Z

src/groupeddatatable/grouping.jl

@@ -193,7 +193,7 @@ combine(map(d -> mean(dropnull(d[:c])), gd))
 """
 function combine(ga::GroupApplied)
    gd, vals = ga.gd, ga.vals
-    valscat = vcat(vals)
+    valscat = vcat(vals...)


Can this go into a separate commit?

Different commit but same PR? I can do that 👍

cjprybol · 2017-04-03T05:22:34Z

Thanks for the review! I'll get to these edits soon. re: slurping/splatting, I'm not sure if I used the right one. Both descriptions are used in the docs, but I'll say something more explicit in the commit to avoid any ambiguity.

nalimilan · 2017-04-03T07:52:51Z

Funny, I didn't remember that term. It's fine then.

cjprybol · 2017-04-04T02:40:57Z

That should address all of the comments. I added more tests to assert that the array promotion was working as intended, and a couple tests for vcatting 3 or more datatables to make sure my offset copy!s were working as intended. I think I'm finally starting to get squashing/rebasing/amending commits too. Coverage drops seem unrelated and should be addressed by #31.

cjprybol · 2017-04-28T03:54:04Z

@nalimilan

nalimilan

Sorry, it's hard for me to keep up with the reviews. Should be good to go after this round.

nalimilan · 2017-05-02T20:39:15Z

src/abstractdatatable/abstractdatatable.jl

+        for i in 1:length(cols)
+            data = [dt[i] for dt in dts]
+            lens = map(length, data)
+            cols[i] = promote_col_type(data...)(reduce(+, lens))


Why not just sum(lens)? Same below.

nalimilan · 2017-05-02T20:42:01Z

src/abstractdatatable/abstractdatatable.jl

+            cols[i] = promote_col_type(data...)(reduce(+, lens))
+            copy!(cols[i], data[1])
+            for j in 2:length(data)
+                offset = reduce(+, lens[1:j-1]) + 1


offset = lens[1] for initialization and offset += lens[j] after each iteration would be simpler.

nalimilan · 2017-05-02T20:45:02Z

test/cat.jl

+        @test eltypes(dt)[1] == Any
+        dt = vcat(DataTable([NullableArray([1])], [:x]), DataTable([[1]], [:x]))
+        @test dt == DataTable([NullableArray([1, 1])], [:x])
+        @test eltypes(dt)[1] == Nullable{Int}


Can you rather test the type of the column, to make sure we get a NullableArray rather than an Array{Nullable}? Same below for (Nullable)CategoricalArray vs. Array{CategoricalValue}`.

nalimilan · 2017-05-02T20:47:00Z

test/grouping.jl

    @test isequal(sum(dt2[:x]), Nullable(0))
+    @test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2)


Why add this line in particular, and check only its size? It looks out of place in this block. Either remove it or make it more complete in its own block.

cjprybol · 2017-05-02T21:15:42Z

No worries! It's not that easy to keep up with the rate at which you do manage to review either. Thanks, as always, for putting in the time to help me improve these PRs.

cjprybol · 2017-05-03T07:00:07Z

Coverage drop doesn't seem related here either. Several functions flagged by Coveralls are handled by #31

nalimilan · 2017-05-03T09:29:31Z

test/cat.jl

+        @test typeof.(dt.columns) == [NullableVector{Int}]
+        dt = vcat(DataTable([CategoricalArray([1])], [:x]), DataTable([[1]], [:x]))
+        @test dt == DataTable([CategoricalArray([1, 1])], [:x])
+        @test typeof.(dt.columns) == [CategoricalVector{Int, UInt32}]


Leave out the UInt32 (or replace it with CategoricalArrays.DefaultRefType, but probably not worth it).

nalimilan · 2017-05-03T09:31:22Z

test/grouping.jl

@@ -148,7 +148,6 @@ module TestGrouping
    dt2 = by(e->1, DataTable(x=Int64[]), :x)
    @test size(dt2) == (0,2)
    @test isequal(sum(dt2[:x]), Nullable(0))
-    @test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2)


Should put this in the first commit.

This change is helpful to support changing vcat to be more consistent with Base.vcat, where passing an array is not allowed.

@nalimilan

This PR removes vcat support for arrays of datatables and makes the Base.vcat style of vcat(args...) the only call option. Removes assumptions for joining datatables with missing, unique, and out of order columns. vcat'ing datatables with unmatched headers results in error messages that explain how the columns are inconsistent. Uses @nalimilan's @generated function to implement a new type of AbstractArray promotion rule that improves handling of NullableArrays and CategoricalArrays. Extends vcat testing.

cjprybol · 2017-05-13T19:39:18Z

Thanks!

cjprybol changed the title ~~update vcat to better mimic Base.vcat and better preserve column type~~ update vcat to mimic Base.vcat and better preserve column type Mar 31, 2017

nalimilan reviewed Apr 2, 2017

View reviewed changes

cjprybol mentioned this pull request Apr 3, 2017

Stop auto-promoting column-types #30

Closed

cjprybol changed the title ~~update vcat to mimic Base.vcat and better preserve column type~~ update vcat to mimic Base.vcat and enhance promotion rules of mixed column type Apr 4, 2017

nalimilan reviewed May 2, 2017

View reviewed changes

nalimilan reviewed May 3, 2017

View reviewed changes

cjprybol added 2 commits May 3, 2017 10:52

Update combine to use splat ... style vcat, like in Base.vcat

dd9c3ef

This change is helpful to support changing vcat to be more consistent with Base.vcat, where passing an array is not allowed.

nalimilan merged commit 776f293 into JuliaData:master May 12, 2017

cjprybol deleted the cjp/vcat branch May 13, 2017 19:39

cjprybol mentioned this pull request Sep 12, 2017

[RFC] Base.vcat AbstractDataFrame should rely on Base.vcat for columns JuliaData/DataFrames.jl#1118

Closed

nalimilan mentioned this pull request Oct 15, 2017

Avoid calling vcat(dfs...) in combine() JuliaData/DataFrames.jl#1261

Merged

nalimilan mentioned this pull request Jan 30, 2018

Vcat automatic recognition of same columns name JuliaData/DataFrames.jl#1347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45

update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45

cjprybol commented Mar 31, 2017 •

edited

Loading

nalimilan left a comment

nalimilan Apr 2, 2017

cjprybol Apr 3, 2017

nalimilan Apr 3, 2017

nalimilan Apr 2, 2017

nalimilan Apr 2, 2017

nalimilan Apr 2, 2017

nalimilan Apr 2, 2017

nalimilan Apr 2, 2017

cjprybol Apr 3, 2017

cjprybol commented Apr 3, 2017

nalimilan commented Apr 3, 2017

cjprybol commented Apr 4, 2017

cjprybol commented Apr 28, 2017

nalimilan left a comment

nalimilan May 2, 2017

nalimilan May 2, 2017

nalimilan May 2, 2017

nalimilan May 2, 2017

cjprybol commented May 2, 2017

cjprybol commented May 3, 2017

nalimilan May 3, 2017

nalimilan May 3, 2017

cjprybol commented May 13, 2017

		@test isequal(sum(dt2[:x]), Nullable(0))
		@test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2)

update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45

update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45

Conversation

cjprybol commented Mar 31, 2017 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented Apr 3, 2017

nalimilan commented Apr 3, 2017

cjprybol commented Apr 4, 2017

cjprybol commented Apr 28, 2017

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented May 2, 2017

cjprybol commented May 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjprybol commented May 13, 2017

cjprybol commented Mar 31, 2017 •

edited

Loading